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Abstract 



We address the problem of general supervised learning when data can only be ac- 
cessed through an (indefinite) similarity function between data points. Existing 
work on learning with indefinite kernels has concentrated solely on binary/multi- 
class classification problems. We propose a model that is generic enough to handle 
any supervised learning task and also subsumes the model previously proposed for 
classification. We give a "goodness" criterion for similarity functions w.r.t. a given 
supervised learning task and then adapt a well-known landmarking technique to 
provide efficient algorithms for supervised learning using "good" similarity func- 
tions. We demonstrate the effectiveness of our model on three important super- 
vised learning problems: a) real-valued regression, b) ordinal regression and c) 
ranking where we show that our method guarantees bounded generalization error. 
Furthermore, for the case of real-valued regression, we give a natural goodness 
definition that, when used in conjunction with a recent result in sparse vector re- 
covery, guarantees a sparse predictor with bounded generalization error. Finally, 
we report results of our learning algorithms on regression and ordinal regression 
tasks using non-PSD similarity functions and demonstrate the effectiveness of 
our algorithms, especially that of the sparse landmark selection algorithm that 
achieves significantly higher accuracies than the baseline methods while offering 
reduced computational costs. 

1 Introduction 

The goal of this paper is to develop an extended framework for supervised learning with similarity 
functions. Kernel learning algorithms JH have become the mainstay of discriminative learning with 
an incredible amount of effort having been put in, both from the theoretician's as well as the prac- 
titioner's side. However, these algorithms typically require the similarity function to be a positive 
semi-definite (PSD) function, which can be a limiting factor for several applications. Reasons being: 
1) the Mercer's condition is a formal statement that is hard to verify, 2) several natural notions of 
similarity that arise in practical scenarios are not PSD, and 3) it is not clear as to why an artificial 
constraint like PSD-ness should limit the usability of a kernel. 

Several recent papers have demonstrated that indefinite similarity functions can indeed be success- 
fully used for learning [2 3 4 51. However, most of the existing work focuses on classification tasks 
and provides specialized techniques for the same, albeit with little or no theoretical guarantees. A 
notable exception is the line of work by |6| [7] [8| that defines a goodness criterion for a similarity 
function and then provides an algorithm that can exploit this goodness criterion to obtain provably 
accurate classifiers. However, their definitions are yet again restricted to the problem of classifi- 
cation as they take a "margin" based view of the problem that requires positive points to be more 
similar to positive points than to negative points by at least a constant margin. 

In this work, we instead take a "target-value" point of view and require that target values of similar 
points be similar. Using this view, we propose a generic goodness definition that also admits the 
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goodness definition of [6] for classification as a special case. Furthermore, our definition can be seen 
as imposing the existence of a smooth function over a generic space defined by similarity functions, 
rather than over a Hilbert space as required by typical goodness definitions of PSD kernels. 

We then adapt the landmarking technique of [6 1 to provide an efficient algorithm that reduces learn- 
ing tasks to corresponding learning problems over a linear space. The main technical challenge at 
this stage is to show that such reductions are able to provide good generalization error bounds for 
the learning tasks at hand. To this end, we consider three specific problems: a) regression, b) ordinal 
regression, and c) ranking. For each problem, we define appropriate surrogate loss functions, and 
show that our algorithm is able to, for each specific learning task, guarantee bounded generalization 
error with polynomial sample complexity. Moreover, by adapting a general framework given by 
[9 1, we show that these guarantees do not require the goodness definition to be overly restrictive by 
showing that our definitions admit all good PSD kernels as well. 

For the problem of real-valued regression, we additionally provide a goodness definition that cap- 
tures the intuition that usually, only a small number of landmarks are influential w.r.t. the learning 
task. However, to recover these landmarks, the uniform sampling technique would require sampling 
a large number of landmarks thus increasing the training/test time of the predictor. We address this 
issue by applying a sparse vector recovery algorithm given by ifTUll and show that the resulting sparse 
predictor still has bounded generalization error. 

We also address an important issue faced by algorithms that use landmarking as a feature construc- 
tions step viz ||6l|7]|8), namely that they typically assume separate landmark and training sets for ease 
of analysis. In practice however, one usually tries to overcome paucity of training data by reusing 
training data as landmark points as well. We use an argument outlined in [ 1 1 J to theoretically justify 



such "double dipping" in our case. The details of the argument are given in Appendix B 



We perform several experiments on benchmark datasets that demonstrate significant performance 
gains for our methods over the baseline of kernel regression. Our sparse landmark selection tech- 
nique provides significantly better predictors that are also more efficient at test time. 

Related Work: Existing approaches to extend kernel learning algorithms to indefinite kernels can 
be classified into three broad categories: a) those that use indefinite kernels directly with existing 
kernel learning algorithms, resulting in non-convex formulations [2 , 3 1. b) those that convert a given 
indefinite kernel into a PSD one by either projecting onto the PSD-cone [4 5 1 or performing other 
spectral operations lfl2l . The second approach is usually expensive due to the spectral operations 
involved apart from making the method inherently transductive. Moreover, any domain knowledge 
stored in the original kernel is lost due to these task oblivious operations and consequently, no 
generalization guarantees can be given, c) those that use notions of "task-kernel alignment" or 
equivalently, notions of "goodness" of a kernel, to give learning algorithms ||6l|71[8). This approach 
enjoys several advantages over the other approaches listed above. These models are able to use 
the indefinite kernel directly with existing PSD kernel learning techniques; all the while retaining 
the ability to give generalization bounds that quantitatively parallel those of PSD kernel learning 
models. In this paper, we adopt the third approach for general supervised learning problem. 



2 Problem formulation and Preliminaries 



The goal in similarity-based supervised learning is to closely approximate a target predictor y : 
X y over some domain X using a hypothesis /( • ;K) : X —> y that restricts its interaction 
with data points to computing similarity values given by K. Now, if the similarity function K is 
not discriminative enough for the given task then we cannot hope to construct a predictor out of it 
that enjoys good generalization properties. Hence, it is natural to define the "goodness" of a given 
similarity function with respect to the learning task at hand. 

Definition 1 (Good similarity function: preliminary). Given a learning task y : X — > y over some 
distribution T>, a similarity function K : X x X — > K is said to be (eo, B)-good with respect to 
this task if there exists some bounded weighing function w : X — > [-B, B] such that for at least a 
(1 — eo) T>-fraction of the domain, we have y(x) = E \w{~x!)y{~x!)K (x, x')] . 

The above definition is inspired by the definition of a "good" similarity function with respect to 
classification tasks given in (6j. However, their definition is tied to class labels and thus applies only 
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Algorithm 1 Supervised learning with Similarity functions 



Input: A target predictor y : X — > y over a distribution T>, an (eo, -B)-good similarity function K, labeled 
training points sampled from T>:T — {(x^, j/i), . . . , (x^, y„)}, loss function fj : I x J 1 -> R + . 

Output: A predictor / : X — > M with bounded true loss over T> 
1: Sample d unlabeled landmarks from V: C — fx', . . . , x^} 

// Else subsample d landmarks from T (see |Appendix B| for details) 
2: * £ : x ^ l/Vd (K(x, xi), . . . , *T(x, x ! d )) G R d 
3:w = argmin e s «w, * £ (x*)) , y t ) 

weR d :||w|| 2 <fl 

4: return / : x h-> (w, *,c(x)} 



to classification tasks. Similar to [6 1, the above definition calls a similarity function K "good" if the 
target value y(x) of a given point x can be approximated in terms of (a weighted combination of) 
the target values of the /^-''neighbors" of x. Also, note that this definition automatically enforces a 
smoothness prior on the framework. 

However the above definition is too rigid. Moreover, it defines goodness in terms of violations, a 
non-convex loss function. To remedy this, we propose an alternative definition that incorporates an 
arbitrary (but in practice always convex) loss function. 

Definition 2 (Good similarity function: final). Given a learning task y : X — > y over some 
distribution T>, a similarity function K is said to be (eo, B)-good with respect to a loss function 
is : K x y — > K if there exists some bounded weighing function w : X — > [-B, B] such that if we 
define a predictor as f (x) := E §w(x.')K(x, x')], then we have E [^g(/(x),y(x))J < e . 

Note that Definition |5] reduces to Definition |T| for £s{a,b) — l{ a ^b}. Moreover, for the case of 
binary classification where y G { — 1: +1}, if we take £,s( a , b) — t{ ab<B7 y, then we recover the 
(eo, 7)-goodness definition of a similarity function, given in Definition 3 of (6). Also note that, 
assuming sup {|y(x)|} < oo we can w.l.o.g. merge w(x')y(x') into a single term iu(x'). 

Having given this definition we must make sure that "good" similarity functions allow the construc- 
tion of effective predictors (Utility property). Moreover, we must make sure that the definition does 
not exclude commonly used PSD kernels (Admissibility property). Below, we formally define these 
two properties and in later sections, show that for each of the learning tasks considered, our goodness 
definition satisfies these two properties. 



2.1 Utility 



Definition 3 (Utility). A similarity function K is said to be ^-useful w.r.t. a loss function l actua i (•, •) 
if the following holds: there exists a learning algorithm A that, for any ei,6 > 0, when given 
poly(l/e\, log(l/<5)) "labeled" and "unlabeled" samples from the input distribution T>, with prob- 

< £o + li- 



ability at least 1 — S , generates a hypothesis /(x; K) s.t. E (-aauai 
Note that /(x; K) is restricted to access the data solely through K. 



(/W,l/(x) 



Here, the eo term captures the misfit or the bias of the similarity function with respect to the learning 
problem. Notice that the above utility definition allows for learning from unlabeled data points and 
thus puts our approach in the semi-supervised learning framework. 

All our utility guarantees proceed by first using unlabeled samples as landmarks to construct a land- 
marked space. Next, using the goodness definition, we show the existence of a good linear predictor 
in the landmarked space. This guarantee is obtained in two steps as outlined in Algorithm^] first of 
all we choose d unlabeled landmark points and construct a map ^ : X — > R d (see StepHjof Algo- 
rithm[T|) and show that there exists a linear predictor over R d that closely approximates the predictor 



/ used in Definition|2](see Lemma 15 in Appendix A I. In the second step, we learn a predictor (over 
the landmarked space) using ERM over a fresh labeled training set (see Step|3]of Algorithm TJ. We 
then use individual task-specific arguments and Rademacher average-based generalization bounds 
lf]~3l thus proving the utility of the similarity function. 
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2.2 Admissibility 



In order to show that our models are not too rigid, we would prove that they admit good PSD 
kernels. The notion of a good PSD kernel for us will be one that corresponds to a prevalent large 
margin technique for the given problem. In general, most notions correspond to the existence of a 
linear operator in the RKHS of the kernel that has small loss at large margin. More formally, 
Definition 4 (Good PSD Kernel). Given a learning task y : X — > y over some distribution T>, a 
PSD kernel K : X x X — > M with associated RKHS Hk and canonical feature map $k ■ X — > Wk 
is said to be (eo, j)-good with respect to a loss function Ik :Kx J->1 if there exists W* £ "Hk 
such that ||W*|| = 1 and 



E 

x~I3 



(r I ,y(x) 



< e 



We will show, for all the learning tasks considered, that every (eo, 7) -good PSD kernel, when treated 
as simply a similarity function with no consideration of its RKHS, is also (e + ei, _B)-good for 
arbitrarily small ei with B = h(j, ei) for some function h. To prove these results we will adapt 
techniques introduced in |9| with certain modifications and task-dependent arguments. 



3 Applications 

We will now instantiate the general learning model described above to real-valued regression, ordinal 
regression and ranking by providing utility and admissibility guarantees. Due to lack of space, we 



relegate all proofs as well as the discussion on ranking to the supplementary material (Appendix Fi. 
3.1 Real-valued Regression 

Real-valued regression is a quintessential learning problem [ 1 ] that has received a lot of attention 
in the learning literature. In the following we shall present algorithms for performing real-valued 
regression using non-PSD similarity measures. We consider the problem with £ a ctuai (a, b) = \a — b\ 
as the true loss function. For the surrogates lg and I k, we choose the e-insensitive loss function JT| 
defined as follows: 

0, if \a— b\ < e, 

I a — b\ — e, otherwise. 

The above loss function automatically gives us notions of good kernels and similarity functions by 
appealing to Definitions [4] and ^respectively. It is easy to transfer error bounds in terms of absolute 
error to those in terms of mean squared error (MSE), a commonly used performance measure for 
real-valued regression. See | Appendix D for further discussion on the choice of the loss function. 



£ e (a, b) = 4 (a - b) 



Using the landmarking strategy described in Section [2~T] we can reduce the problem of real regres- 
sion to that of a linear regression problem in the landmarked space. More specifically, the ERM step 
in Algorithmfljbecomes the following: arg min ^™ l e ((w, ^cij^-i)) — Hi)- 

■w€R d :\\w\\ 2 <B 

There exist solvers (for instance lfl4l ) to efficiently solve the above problem on linear spaces. Using 



proof techniques sketched in Section 2. 1 along with specific arguments for the e-insensitive loss, we 
can prove generalization guarantees and hence utility guarantees for the similarity function. 
Theorem 5. Every similarity function that is (eo, B) -good for a regression problem with respect 
to the insensitive loss function l t (•, •) is (eo + e)-useful with respect to absolute loss as well as 
(Beo + Be)-useful with respect to mean squared error. Moreover, both the dimensionality of the 

landmarked space as well as the labeled sample complexity can be bounded by O (^yr log j-j . 
We are also able to prove the following (tight) admissibility result: 

Theorem 6. Every PSD kernel that is (eo,7) -good for a regression problem is, for any t\ > 0, 



eo + €1, O J J-good as a similarity function as well. Moreover, for any t\ < 1/2 and any 
7 < 1, there exists a regression instance and a corresponding kernel that is (0,^/)-good for the 
regression problem but only (ei, B) -good as a similarity function for B — ^jpp"^- 
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3.2 Sparse regression models 



An artifact of a random choice of landmarks is that very few of them might turn out to be "informa- 
tive" with respect to the prediction problem at hand. For instance, in a network, there might exist 
hubs or authoritative nodes that yield rich information about the learning problem. If the relative 
abundance of such nodes is low then random selection would compel us to choose a large number 
of landmarks before enough "informative" ones have been collected. 

However this greatly increases training and testing times due to the increased costs of constructing 
the landmarked space. Thus, the ability to prune away irrelevant landmarks would speed up training 
and test routines. We note that this issue has been addressed before in literature (8] [12] by way 
of landmark selection heuristics. In contrast, we guarantee that our predictor will select a small 
number of landmarks while incurring bounded generalization error. However this requires a careful 
restructuring of the learning model to incorporate the "informativeness" of landmarks. 
Definition 7. A similarity function K is said to be (eo, B,r)-good for a real-valued regression 
problem y : X —} M if for some bounded weight function w : X — > [-B, B] and choice function 
R : X -)• {0, 1} with E [i?(x)] = t, the predictor / : x i-> E [w(x')i4r(x,x')|-R(x')] has 

x~X> x'~Z> 

bounded e-insensitive loss i.e. E \l e (/( x ), 2/( x ))l < e o- 

x~Z> 

The role of the choice function is to single out informative landmarks, while r specifies the relative 
density of informative landmarks. Note that the above definition is similar in spirit to the goodness 
definition presented in fl5l . While the motivation behind fl31 was to give an improved admissi- 
bility result for binary classification, we squarely focus on the utility guarantees; with the aim of 
accelerating our learning algorithms via landmark pruning. 



We prove the utility guarantee in three steps as outlined in Appendix D First, we use the usual 
landmarking step to project the problem onto a linear space. This step guarantees the following: 
Theorem 8. Given a similarity function that is (eo, B, r)-goodfor a regression problem, there exists 
a randomized map : X — > R d for d = O l°g such that with probability at least 1 — 5, 

there exists a linear operator f : x i— > (w, x) over M. d such that < B with e-insensitive loss 

bounded by eo + t\. Moreover, with the same confidence we have ||w|| < 



3dr 
2 



Our proof follows that of [ 15], however we additionally prove sparsity of w as well. The number of 
landmarks required here is a Q (1/t) fraction greater than that required by Theorem|5] This formally 
captures the intuition presented earlier of a small fraction of dimensions (read landmarks) being ac- 
tually relevant to the learning problem. So, in the second step, we use the Forward Greedy Selection 
algorithm given in ifTUl to learn a sparse predictor. The use of this learning algorithm necessitates 
the use of a different generalization bound in the final step to complete the utility guarantee given 
below. We refer the reader to |Appendix D| for the details of the algorithm and its utility analysis. 
Theorem 9. Every similarity function that is (e , B, r)-good for a regression problem with respect 
to the insensitive loss function £ e (•, •) is (eo + e) -useful with respect to absolute loss as well; with the 

dimensionality of the landmarked space being bounded by O (j^2 log |^ and the labeled sampled 

complexity being bounded by O (~r log . Moreover, this utility can be achieved by an O (t)- 
sparse predictor on the landmarked space. 

We note that the improvements obtained here by using the sparse learning methods of 1 10 1 provide 
(t) increase in sparsity. We now prove admissibility results for this sparse learning model. We 
do this by showing that the dense model analyzed in Theorem [5] and that given in Definition [7] are 
interpretable in each other for an appropriate selection of parameters. The guarantees in Theorem[6] 
can then be invoked to conclude the admissibility proof. 

Theorem 10. Every (eo, B)-good similarity function K is also (eo, B, ^-good where w = 
E ||u;(x) |]. Moreover, every (eo, B, r)-good similarity function K is also (eo, B /r)-good. 

Using Theorem|6] we immediately have the following corollary: 

Corollary 11. Every PSD kernel that is (eo, ^-good for a regression problem is, for any e\ > 0, 
^e + t\,0 (^^j , lj -good as a similarity function as well. 
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3.3 Ordinal Regression 



The problem of ordinal regression requires an accurate prediction of (discrete) labels coming from 
a finite ordered set [r] = {1,2,..., r}. The problem is similar to both classification and regression, 
but has some distinct features due to which it has received independent attention |[T6l[T7l in domains 
such as product ratings etc. The most popular performance measure for this problem is the absolute 
loss which is the absolute difference between the predicted and the true labels. 

A natural and rather tempting way to solve this problem is to relax the problem to real-valued 
regression and threshold the output of the learned real-valued predictor using predefined thresholds 
61, . . . , b r to get discrete labels. Although this approach has been prevalent in literature [ 17], as the 
discussion in the supplementary material shows, this leads to poor generalization guarantees in our 
model. More specifically, a goodness definition constructed around such a direct reduction is only 
able to ensure (eo + Inutility i.e. the absolute error rate is always greater than 1. 

One of the reasons for this is the presence of the thresholding operation that makes it impossible to 
distinguish between instances that would not be affected by small perturbations to the underlying 
real-valued predictor and those that would. To remedy this, we enforce a (soft) margin with respect 
to thresholding that makes the formulation more robust to noise. More formally, we expect that if 
a point belongs to the label i, then in addition to being sandwiched between the thresholds b t and 
bi + i, it should be separated from these by a margin as well i.e. 6j + 7 < /(x) < bi+i — 7. 

This is a direct generalization of the margin principle in classification where we expect w T x > 6+7 
for positively labeled points and w T x < b — 7 for negatively labeled points. Of course, wherein 
classification requires a single threshold, we require several, depending upon the number of labels. 
For any x G M, let [x] + = max {x, 0}. Thus, if we define the 7-margin loss function to be [x] 7 := 
[7 — x], (note that this is simply the well known hinge loss function scaled by a factor of 7), we 
can define our goodness criterion as follows: 

Definition 12. A similarity function K is said to be (eo, B)-good for an ordinal regression problem 
y : X — > [r] if for some bounded weight function w : X — > [—B,B] and some (unknown but fixed) 
set of thresholds {&i}[ =1 with b\ = —00, the predictor f : x t— > E Jw(x')i ; i'(x, x')] satisfies 



E 



[/(x) - b y(x) ] + [&j,( x )+i - /(x)]. 



< e . 



We now give utility guarantees for our learning model. We shall give guarantees on both the mis- 
classification error as well as the absolute error of our learned predictor. We say that a set of points 
X\, . . . , Xi . . . is A-spaced if min{|xj — xA} > A. Define the function iPa( x ) = :E+ ^~ 1 ■ 

Theorem 13. Let K be a similarity function that is (eo, B)-good for an ordinal regression prob- 
lem with respect to A-spaced thresholds and "{-margin loss. Let 7 = max {7,1}. Then K is 

^(A/7) \ -useful with respect to ordinal regression error (absolute loss). Moreover, K is f^V 
useful with respect to the zero-one mislabeling error as well. 

We can bound, both dimensionality of the landmarked space as well as labeled sampled complexity, 
by O yyr log j^J . Notice that for eo < 1 and large enough d, n, we can ensure that the ordinal 
regression error rate is also bounded above by 1 since sup (ipA (x)) = 1. This is in contrast 

a;e[0,l],A>0 

with the direct reduction to real valued regression which has ordinal regression error rate bounded 
below by 1. This indicates the advantage of the present model over a naive reduction to regression. 

We can show that our definition of a good similarity function admits all good PSD kernels as well. 
The kernel goodness criterion we adopt corresponds to the large margin framework proposed by 
lfl6l . We refer the reader to Appendix E.3 for the definition and give the admissibility result below. 



Theorem 14. Every PSD kernel that is (eo, 7)-goot/ for an ordinal regression problem is also 

^71 eo + ei,C (^ifS^ j-good as a similarity function with respect to the ji-margin loss for any 

71, t\ > 0. Moreover, for any e\ < 7i/2, there exists an ordinal regression instance and a corre- 
sponding kernel that is (0, j)-good for the ordinal regression problem but only (ei, B)-good as a 

similarity function with respect to the ji-margin loss function for B — fl 
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(b) Avg. absolute error for landmarking (ORLand) and kernel regression (KR) on ordinal regression datasets 



Figure 1 : Performance of landmarking algorithms with increasing number of landmarks on real- 
valued regression (Figure lai and ordinal regression (Figure lbi datasets. 



Datasets 


Sigmoid kernel 


Manhattan kernel 


KR 


Land-Sp 


KR 


Land-Sp 


Abalone [18] 
JV = 4177 
d = 8 


2.1e-002 
(8.3e-0O4) 


6.2e-O03 
(8.4e-004) 


1.7e-002 
(7.1e-004) 


6.0e-003 
(3.7e-004) 


Bodyfat [19] 
JV = 252 
d = 14 


4.6e-004 
(6.5e-005) 


9.5e-005 
(1.3e-004) 


3.9e-004 
(2.2e-005) 


3.5e-005 
(1.3B-005) 


CAHousing [19] 
JV = 20640 
d = 8 


5.9e-002 
(2.3e-004) 


1.6e-002 
(6.2e-004) 


5.8e-002 
(1.9e-004) 


1.5e-002 
(1.4e-004) 


CPUData [20] 
JV = 8192 
d = 12 


4.1e-002 
(1.6e-0O3) 


1.4e-003 
(1.7e-004) 


4.3e-002 
(1.6e-003) 


1.2e-0O3 
(3.2e-005) 


PumaDyn-8 [20| 
JV = 8192 
d = 8 


2.3e-001 
(4.6e-003) 


1.4e-002 
(4.5e-004) 


2.3e-001 
(4.5e-003) 


1.4e-002 
(4.8e-004) 


PumaDyn-32 [20] 
JV = 8192 
d = 32 


1.8e-001 
(3.6e-003) 


1.4e-002 
(3.7e-004) 


1.8e-001 
(3.6e-003) 


1.4e-002 
(3.1e-004) 



(a) Mean squared error for real regression 



Datasets 


Sigmoid kernel 


Manhattan kernel 


KR 


ORLand 


KR 


ORLand 


Wine-Red [18] 
JV = 1599 
d = 11 


6.8e-O01 
(2.8e-002) 


4.2e-001 
(3.8e-002) 


6.7e-001 
(3.0e-002) 


4.5e-001 
(3.2e-002) 


Wine-White [18] 
JV = 4898 
d = 11 


6.2e-001 
(2.0e-002) 


8.9e-001 
(8.5e-001) 


6.2e-001 
(2.0e-002) 


4.9e-001 
(1.5e-002) 


Bank-8 [20] 
JV = 8192 
d = 8 


2.9e+000 
(6.2e-002) 


6.1e-001 
(4.4e-002) 


2.7e+000 
(6.6e-002) 


6.3e-001 
(1.7e-002) 


Bank-32 [20] 
JV = 8192 
d = 32 


2.7e+000 
(1.2e-001) 


1.6e+000 
(2.3e-002) 


2.6e+000 
(8.1e-002) 


1.6e+000 
(9.4e-002) 


House-8 [20] 
JV = 22784 
d = 8 


2.8e+000 
(9.3e-003) 


1.5e+000 
(2.0e-002) 


2.7e+000 
(LOe-002) 


1.4e+000 
(1.2e-002) 


House-16 [20] 
JV = 22784 
d = 16 


2.7e+000 
(2.0e-002) 


1.5e+000 
(LOe-002) 


2.8e+000 
(2.0e-002) 


1.4e+000 
(2.3e-002) 



(b) Mean absolute error for ordinal regression 



Table 1: Performance of landmarking-based algorithms (with 50 landmarks) vs. baseline kernel 
regression (KR). Values in parentheses indicate standard deviation values. Values in the first columns 
indicate dataset source (in parentheses), size (N) and dimensionality (d). 



Due to lack of space we refer the reader to Appendix F for a discussion on ranking models that 



includes utility and admissibility guarantees with respect to the popular NDCG loss. 



4 Experimental Results 

In this section we present an empirical evaluation of our learning models for the problems of real- 
valued regression and ordinal regression on benchmark datasets taken from a variety of sources 
lfT8l[T9ll20ll . In all cases, we compare our algorithms against kernel regression (KR), a well known 
technique [21 ] for non-linear regression, whose predictor is of the form: 

E X! er X ( x ' x ») 

where T is the training set. We selected KR as the baseline as it is a popular regression method that 
does not require similarity functions to be PSD. For ordinal regression problems, we rounded off the 
result of the KR predictor to get a discrete label. We implemented all our algorithms as well as the 
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baseline KR method in Matlab. In all our experiments we report results across 5 random splits on 
the (indefinite) Sigmoid: K(x, y) = tanh(a (x, y) + r) and Manhattan: K(x,y) = — ||x — yl^ 
kernels. Following standard practice, we fixed r = — 1 and a = l/d 01 -i g for the Sigmoid kernel 
where d OI i g is the dimensionality of the dataset. 

Real valued regression: For this experiment, we compare our methods (RegLand and RegLand-Sp) 
with the KR method. For RegLand, we constructed the landmarked space as specified in Algorithm[T] 
and learned a linear predictor using the LIB LINEAR package [14| that minimizes e-insensitive 
loss. In the second algorithm (RegLand-Sp), we used the sparse learning algorithm of ifTOl on the 
landmarked space to learn the best predictor for a given sparsity level. Due to its simplicity and 
good convergence properties, we implemented the Fully Corrective version of the Forward Greedy 
Selection algorithm with squared loss as the surrogate. 



We evaluated all methods using Mean Squared Error (MSE) on the test set. Figure la shows the MSE 
incurred by our methods along with reference values of accuracies obtained by KR as landmark sizes 
increase. The plots clearly show that our methods incur significantly lesser error than KR. Moreover, 
RegLand-Sp learns more accurate predictors using the same number of landmarks. For instance, 
when learning using the Sigmoid kernel on the CPUData dataset, at 20 landmarks, RegLand is able 
to guarantee an MSE of 0.016 whereas RegLand-Sp offers an MSE of less than 0.02 ; MLKR is 



only able to guarantee an MSE rate of 0.04 for this dataset. In Table la we compare accuracies of 
the two algorithms when given 50 landmark points with those of KR for the Sigmoid and Manhattan 
kernels. We find that in all cases, RegLand-Sp gives superior accuracies than KR. Moreover, the 
Manhattan kernel seems to match or outperform the Sigmoid kernel on all the datasets. 

Ordinal Regression: Here, we compare our method with the baseline KR method on benchmark 
datasets. As mentioned in Section [33) our method uses the EXC formulation of |16| along with 
landmarking scheme given in Algorithm [T] We implemented a gradient descent-based solver (OR- 
Land) to solve the primal formulation of EXC and used fixed equi-spaced thresholds instead of 
learning them as suggested by [16|. Of the six datasets considered here, the two Wine datasets are 
ordinal regression datasets where the quality of the wine is to be predicted on a scale from 1 to 10. 
The remaining four datasets are regression datasets whose labels were subjected to equi-frequency 
binning to obtain ordinal regression datasets [ 16 1. We measured the average absolute error (AAE) 



for each method. Figure lb compares ORLand with KR as the number of landmarks increases. Ta- 
ble [lb] compares accuracies of ORLand for 50 landmark points with those of KR for Sigmoid and 
Manhattan kernels. In almost all cases, ORLand gives a much better performance than KR. The 
Sigmoid kernel seems to outperform the Manhattan kernel on a couple of datasets. 

We refer the reader to |Appendix G| for additional experimental results. 



5 Conclusion 



In this work we considered the general problem of supervised learning using non-PSD similarity 
functions. We provided a goodness criterion for similarity functions w.r.t. various learning tasks. 
This allowed us to construct efficient learning algorithms with provable generalization error bounds. 
At the same time, we were able to show, for each learning task, that our criterion is not too restrictive 
in that it admits all good PSD kernels. We then focused on the problem of identifying influential 
landmarks with the aim of learning sparse predictors. We presented a model that formalized the 
intuition that typically only a small fraction of landmarks is influential for a given learning problem. 
We adapted existing sparse vector recovery algorithms within our model to learn provably sparse 
predictors with bounded generalization error. Finally, we empirically evaluated our learning algo- 
rithms on benchmark regression and ordinal regression tasks. In all cases, our learning methods, 
especially the sparse recovery algorithm, consistently outperformed the kernel regression baseline. 

An interesting direction for future research would be learning good similarity functions a la metric 
learning or kernel learning. It would also be interesting to conduct large scale experiments on real- 
world data such as social networks that naturally capture the notion of similarity amongst nodes. 
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Supplementary Material 



Throughout this document, theorems and lemmata that were not originally proven as a part of this 
work cite, as a part of their statement, the work that originally presented the proof. 



Appendix A Proofs of supplementary theorems 



In this section we give proofs for certain generic results that would be used in the utility and admissi- 
bility proofs. The first result, given as Lemma[l5] allows us to analyze the landmarking step (Step[T| 
of Algorithm [T} and allows us to reduce the learning problem to that of learning a linear predictor 
over the landmarked space. The second result, given as Lemma[l6] gives us a succinct re-statement 
of generalization error bounds proven in lfl3ll that would be used in proving utility bounds. The 
third result, given as Lemma 17 is a technical result that helps us prove admissibility bounds for our 
goodness definitions. 

Lemma 15 (Landmarking approximation guarantee 10). Given a similarity function K over a do- 
main X and abounded function of the form f(x) = E [^(x')^ (x, x')] for some bounded weight 

X'-'P 

function w : X — > {—B, B}, for every e, 5 > there exists a randomized map \t : X — > R d for 
d = d(e, 5) such that with probability at least 1 — 6, there exists a linear operator f over R d such 



that E 

x~25 



/(*(x))-/(x) 



< e. 



Proof. This result essentially allows us to project the learning problem into a Euclidean space where 
one can show, for the various learning problems considered here, that existing large margin tech- 
niques are applicable to solve the original problem. The result appeared in [ 8 1 and is presented here 
for completeness. 

Sample d landmark points C = {xi,...,Xd} from T> and construct the map : x i->- 
(-ftT(x,xi), . . . , K(x, Xrf)) and consider the linear operator / over R d defined as follows (in 
the following, we shall always omit the subscript C for clarity): 



1 d 

f ■ X H^. - 2_j w(Xi)K(x, Xj 

i=l 



(w,*(x)> 



for w 



^(«;(xi),...,tu(x d )) 



G 



A standard Hoeffding-style argument shows that for d 



O log j2^j = O log jj^j, f gives a point wise approximation to /, i.e. for all x G X, with 



probability greater than 1 — S 2 , we have 



/(*(x)) - /(x) 



< e. 



Now call the event BAD-APPROX (x) := /(*(x)) - /(x) 



P[BAD-APPROX(x)] = E [1 BA d-approx(> 

/ / 



> e. Thus we have for all x G X, 



< S 2 (here the probabilities are being taken over 
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the construction of / i.e. the choice of the landmark points). Taking expectations over the entire 
domain, applying Fubini's theorem to switch expectations and applying Markov's inequality we get 



[BAD-APPROX (x)] > 6 



< S 



Thus with confidence 1 — 5 we have 



[BAD-APPROX (x)] < S and thus 



E 

x~X> 



/(*(x)) - /(x) < (1 - S)e + 2BS since sup /(*(x)) 



S < j; we get E 



/(*(x)) - /(x) 



< 2e. 



sup |/(x)| = B. For 

□ 



Lemma 16 (Risk bounds for linear predictors |13|). Consider a real-valued prediction 
problem y over a domain X = {x : ||x|| 2 < Cx} and a linear learning model T : 
{x i— > (w, x) : 1 1 "w 1 1 2 < Cw} under some fixed loss function £ (•, •) that is C^-Lipschitz in its sec- 
ond argument. For any f € T, let Ct = E J£(/(x) , y(x))] and CI be the empirical loss on a set 

of n i.i.d. chosen points. Then we have, with probability greater than (1 — S), 



sup iCf-C?) <3C L C X C W 



log(l/<0 



Proof. There exist a few results that provide a unified analysis for the generalization properties of 
linear predictors Ifl3ll22ll . However we use the heavy hammer of Rademacher average based analysis 
since it provides sharper bounds than covering number based analyses. 

The result follows from imposing a squared L 2 regularization on the w vectors. Since the squared 
L 2 function is 2-strongly convex with respect to the L 2 norm, using lfT3l Theorem 1], we get a bound 

on the Rademacher complexity of the function class T as 7Z n (J 7 ) < CxCy/\J\- Next, using the 
Lipschitz properties of the loss function, a result from [23] allows us to bound the excess error by 

ICifR-nifF) + Ci, Cx Cw\j The result then follows from simple manipulations. □ 

Lemma 17 (Admissible weight functions for PSD kernels J9)). Consider a PSD kernel that 
is (eo,j)-good for a learning problem with respect to some convex loss function Ik- Then 
there exists a vector W G Hk ond a bounded weight function w : X — > K such that 
E \Ik ((W, $/f(x)) , y(x))] < eo + 2 J 7 2 for some arbitrary positive constant C and for all 

xe X, we have E fw(x')K(x, x')] = (W, $*-(x)). 



Proof. Note that the (eo, 7)-goodness of K guarantees the existence of a weight vector W* £ Hk 
with small loss at large margin. Thus W acts as a proxy for W* providing bounded loss at unit 
margin but with the additional property of being functionally equivalent to a bounded weighted 
average of the kernel values as required by the definition of a good similarity function. This will 
help us prove admissibility results for our similarity learning models. 

We start by proving the theorem for a discrete distribution - the generalization to non-discrete dis- 
tributions will follow by using variational optimization techniques as discussed in [9]. Consider a 
discrete learning problem with X = {xi, . . . , x„}, corresponding distribution T> = {pi, . . . ,p n } 
and target y = {yi, . . . , y n } such that J^Pi = !• ^ et U P the following regularized ERM problem 
(albeit on the entire domain): 

mm \ ||W||^ +C^T Pi £k «W, ^(x,)) , Vi ) 

i—1 

Let W be the weight vector corresponding to the optima of the above problem. By the Representer 
Theorem (for example l24l ). we can choose W = a^Ki^-i) for some bounded on (the exact 
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bounds on on are problem specific). By (eg, 7) -goodness of K we have 

-, n 2 n 

^\\w\\ 2 nK + cY,^K((w'^ K (^)), yi ) < - -w* +cY,p4k 



i=l 



7 



1 

v 

1 



Hi 

C- E 



(W*,$x(xi)) 



A" 



(W*,$*(x)) 



Thus we have 

E 



1 

; p [^((W',^(x)),y(x))] < —\\W'\\ 2 HK+ Y jPl iKm l ^K(^)),y i ) 



< eo + 



2C 7 2 



which proves the first part of the claim. For the second part, set up a weight function Wi 
Then, for any x 6 A" we have 

E Mx')iC(x,x')] 



i=l 



£ ^ (x), = (W, $ K (x)) 



i=l 



The weight function is bounded since the on are bounded and, this being a discrete learning problem, 
cannot have vanishing probability masses Pi (actually, in the cases we shall consider, the on will 
itself contain a pi term that will subsequently get cancelled). For non-discrete cases, variational 
techniques give us similar results. □ 



Appendix B Justifying Double-dipping 

All our analyses (as well as the analyses presented in J6j [7j [8]) use some data as landmark points 
and then require a fresh batch of training points to learn a classifier on the landmarked space. In 
practice, however, it might be useful to reuse training data to act as landmark points as well. This 
is especially true of Q E:l who require labeled landmarks. We give below, generalization bounds 
for similarity-based learning algorithms that indulge in such "double dipping". The argument uses 
a technique outlined in ifTTI and falls within the Rademacher-average based uniform convergence 
guarantees used elsewhere in the paper. We present a generic argument that, in a manner similar to 



Lemma 16 can be specialized to the various learning problems considered in this paper. 



To make the presentation easier we set up some notation. For any predictor /, let Cf — 

JE |£(/(x),2/(x))] and for any training set S of size n, let t s f = ± J2 Xi es ^(/( x i)> y( x i))- 

For any landmark set S — (x 1 ,...,x n ), we let ^5 : x H> (K(x, Xi), . . . , K(x, x„)). 
For any weight vector w £ K™, ||w|| < B in the landmarked space, denote the predictor 

n 

/(S,w) := ^r(w,*s(x)> = x 1 v i£wi%]ti). Also let F s := {x ^ I (w, * s (x)) } = 

{/(^iwermhl^}, 

We note that the embedding defined above is "stable" in the sense that changing a single landmark 
does not change the embedding too much with respect to bounded predictors. More formally, for 

any set of n points S = (xi , . . . , x„), define g(S) := sup sCf — C s f \ . Let S % be another set of n 

points that (arbitrarily) differs from S just at the i th point and coincides with S on the rest. Then we 
have, for any fixed w of bounded norm (i.e. ||w|| < B) and bounded similarity function (i.e. 
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sup{|/( SiW )(x)-/( S<)W) (x)|} = sup 



< 





1 


sup < 




X I 


n 




1 


SU P { 






n 


2B 




n 





1 71 1 n 

- H w^(x, xj) - - J2 w^(x, x ; 



j=l »=1 
-Wi (iT(x, x«) - Jsr(x, x0) 



Note that, although 1 8 1 uses pairs of labeled points to define the embedding, the following argument 
can easily be extended to incorporate this since the embedding is identical to the embedding vpg 
described above with respect to being "stable". In fact this analysis holds for any stable embedding 
defined using training points. 

Our argument proceeds by showing that with high probability (over choice of the set S) we have 

sup{£/ (SiW) -£/ (S , w) } <( 

w k J 

By the definition of J 7 ^, the above requirement translates to showing that with high probability, 



sup 



< e 



which highlights the fact that we are dealing with a problem of sample dependent hypothesis spaces^] 
Note that this exactly captures the double dipping procedure of reusing training points as landmark 
points. Such a result would be useful as follows: using Lemma [15] and task specific guarantees 
(outlined in detail in the subsequent sections), we have, with high probability, the existence of a 
good predictor in the landmarked space of a randomly chosen landmark set S i.e. with very high 
probability over choice of S, we have inf {£/} < cq. Let this be achieved by the predictor /*. 



Using the uniform convergence guarantee above we get C^, < £q 



due to application of a union bound). 

Now consider the predictor / := inf < CM >. Clearly £| < C 



convergence bound yet again shows us that 



sup 



C 



f 



Cf}<e 



e (with some loss of confidence 



< Cq + e. Invoking the uniform 



2e 



Note that we incur some more loss of confidence due to another application of the union bound. 
This tells us that with high probability, a predictor learned by choosing a random landmark set and 
training on the landmark set itself would yield a good predictor. 

We will proceed via a vanilla uniform convergence argument involving symmetrization and an appli- 
cation of the McDiarmid's inequality (stated below). However, proving the stability prerequisite for 
the application of the McDiarmid's inequality shall require use of the stability of both the predictor 
/(S,w) as we U as m e embedding ^s- Let the loss function I be Ci-Lipschitz in its first argument. 

Theorem 18 (McDiarmid's inequality [25 1). Let X±, .. . ,X n be independent random variables 
taking values in some set X. Further, let f : X n — > K be a function ofn variables that satisfies, for 
all i £ [n] and all X\, . . . , x n , x\ € X, 



then for all e > 0, we have 



i / 0^1 ) * • * ? • • • j •^n) I — &i 



P[/-E[/] >e] <exp 



-2e 2 



'We were not able to find any written manuscript detailing the argument of 1111 . However the argument 
itself is fairly generic in allowing one to prove generalization bounds for sample dependent hypothesis spaces. 
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We shall invoke the McDiarmid's inequality on the function g(S) := sup \C f -t s f \ with S = 

(xi, . . . , x„) being the random variables in question. To do so we first prove the stability of the 
function g(S) with respect to its variables and then bound the value of E [£/(<!?)] . 

Theorem 19. For any S, S\ we have \g(S) - g(S l )\ < ^ps 
Proof. We have 

sup \c f -C s f -Cf +lf } 



g{S) = sup \ Cf 



fer s 
< sup \Cf 

fEFs 1 



< sup {c f - If } 
fer s 1 J 



sup 

fe^s 

2BC L 



n 



where in the fourth step we have used the fact that the loss function is Lipschitz and the embedding 
function * 5 is bounded. We also have 



sup 



= sup {£ /(S , w) - £ /(sS w) + £ /(sS w) - lf {si ^ + lf {si ^ If^ } 
( r rs> \,2BCl 2BCl 



sup 



-/(S«,w) W 

{£f-*f} 



+ 



ABCr 



= g(s l ) + 

where in the fourth step we have used the stability of the embedding function and 
that the loss function is Cl -Lipschitz in its first argument so that for all x we have 

2BC L 



|^(/(S,w)(x),y(x)) -I (/ (S i iW) (x),y(x))| 



< — which holds in expectation over any (em- 

6BC L 



pirical) distribution as well. Putting the two inequalities together gives us g(S) < g(S l ) + 

QBCl 

Similarly we also have g(S l ) < g(S) H which gives us the result. 



□ 



We now have that the function g(S) is O 



-stable with respect to each of its inputs. We now 



move on to bound its expectation. For any function class T we define its empirical Rademacher 
average as follows 



U n (T) := E 



sup 



{^E -</(*)} 



S 



Also let J 7 := {x >->• (w,x) : ||w|| 2 < B} and X := {x : ||x|| 2 < 1}. 



Theorem 20. E 

s 



sup \ Cf — C s f 



< 2BC L 
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Proof. We have 



sup Uf-lf] 



sup 



IS' 



C 



s' 



fiS 



< E 

S,S' 



sup 



< E 

S,S' 



E 

S.S'a 



< 2 E 

S,S'cr 



2 E 

S.S'a- 



sup 



< 2 



sup {£f-£f} 

J fc -^ sus' 

3 U ^ C r i (^(/(^),2/(xO)-^(/(x i ),y(x < ))) 

sup < - V ad(f(xi),y(xi)) 

sup 1 X! cr ^(/(sus',w)(x l ),y(x J )) I 
w l n x i£ s J. 

E supU £ adifixi)^^))} 



2E 

s 



n n {ioT) 



< 2C L E 
~ s 



< 2BC L 



where in the third step we have used the fact that J^s D Fs< if S ~D S' (this is the monotonicity 
requirement in ifTTl '). Note that this is essential to introduce symmetry so that Rademacher variables 
can be introduced in the next (symmetrization) step. In the seventh step, we have used the fact that 
for every 5 such that 15*1 = n and w € R n such that ||w|| < B, there exists a function / 6 J 7 
such that for all x, there exists a x' 6 X such that /(s,w)( x ) = /(x'). In the last step we have 
used a result from |26| which allows calculation of Rademacher averages for composition classes 
and an intermediate result from the proof of LemmafTo] which gives us Rademacher averages for the 
function class T . □ 

Thus, by an application of McDiarmid's inequality we have, with probability (1 — 5) over choice of 
the landmark (training) set, 



sup iCf — C s f \ < E 



sup lc f -t s f 



which concludes our argument justifying double dipping. 

Appendix C Regression with Similarity Functions 

In this section we give proofs of utility and admissibility results for our similarity based learning 
model for real-valued regression tasks. 

C.l Proof of TheoremU] 

First of all, we use Lemma \l5\ to project onto a d dimensional space where there exists a linear 
predictor /:x4(w,x) such that E / (x)) - /(x) < 2e x . Note that ||w|L < B and 

sup {||\&(x)||} < 1 by construction. We will now show that / has bounded e-insensitive loss. 
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E 

x~X> 



U/(¥(x)),y(x) 



= E [4(/(x),y(x 



- E 

x~X> 



< 




f- E 






< 


eo - 


h E 




x~X> 


< 


eo - 


H2 £l 



4 (/(*(x)),y(x))-4 (/(x),y(x)) 
4 (/(*(x)),t/(x)) -4 (/(x),y(x)) 
/(*(x))-/(x) 



where in the second step we have used the goodness properties of K, in the third step we used 
the fact that the e-insensitive loss function is 1-Lipschitz in its first argument. Note that ||w|| ss 
E |u; 2 (x)] with high probability and if E [w 2 (x)| -C B then we get a much better bound on 

the norm of w. The excess loss incurred due to this landmarking step is, with probability 1 — S, at 

most 32By^S&Zfl. 

Now consider the following regularized ERM problem on n i.i.d. sample points: 



1 

w= argmin - V]4 ((w,*(xj)) ,y(xi)) 

w:||w|| 2 <B" i=1 



The final output of our learning algorithm shall be x n- (w J&(x)). Here we have Cx — 1, Cl = 1 
since 4 (•) is 1-Lipschitz and CV = -B. Thus by Lemma 16 we get that the excess loss incurred 

due to this regularized ERM step is at most 3B 



log 1/5 



Since the e-insensitive loss is related to the absolute error by \x\ < 4 ( x ) + e we have the total error 
(with respect to absolute loss) being incurred by our predictor to be, with probability at least 1 — 25, 
at most 



n 



Taking d — O ( log | \ unlabeled landmarks and n 
gives us our desired result. 



O ( t- log I ) labeled training points 



C.2 Proof of Theorem© 



We prove the two parts of the result separately. 
Part 1: Admissibility: Using Lemma 



a*)$i<(xj) G Hk with small loss such that 
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it is possible to obtain a vector W = J2( c 



i=l 



< a.i,OL* < PiC and c^a* = (these inequalities 



are a consequence of applying the KKT conditions). This allows us to construct as weight function 
Wi = such that 1^1 < C* and E {w{x')K(x, x')] = <W, $rt(x)) for all x € X. 



Thus we have E 



l 

2C 7 2 



e . Setting C = 



I A E \w(tk!)K(x., x')] , j/(x) 
gives us our result. 



= E [4 ((W, $if(x)) , y(x))] < 

x~X> 



2e l7 2 



We can use variational techniques to extend this to non-discrete distributions as well. 

Part 2: Tightness: The tight example that we provide is an adaptation of the example given for 
large margin classification in |9|. However, our analysis differs from that of |9|, partly necessitated 
by our choice of loss function. 
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Consider the following regression problem: X = {xi, X2, X3, X4} C M 3 , V — { | — e, e, e, ^ — e}, 
!/ = {+!, +1,-1,-1} 



x 2 

X.3 
X 4 



(7,7, Vl-V) 

(7, -7, \A - 2 7 2 ) 
(-7, 7, Vl- 27 2 ) 
(-7,-7, Vl - 2 7 2 ) 



Clearly the vector w = (1,0,0) yields a predictor y' with no e-insensitive loss for e = (i.e. 
E \Iq (y(x) — y'(x))] = 0) at margin 7. Thus the native inner product (•, •) on W 3 is a (0,7)- 

x~X> 

good kernel for this particular regression problem. 

Now consider any bounded weighing function on X, w = {wi, u>2, w 3 , W4} and analyze the ef- 
fectiveness of (•, •) as a similarity function. The output y of the resulting predictor on the different 

4 

points is given by y t = PjW-j (Xj, x,-). 

In particular, consider the output on the heavy points Xi and X4 (note that the analysis in [9 1 considers 
the light points x 2 and x 3 instead). We have 



Vi 



Vi = 



Q - ej Wl + e (1 - 2 7 2 ) (w 2 + w 3 ) + Q - e) w 4 (l - 4 7 2 ) = a + Q - ej ( 
Q - e) (1 - 4 7 2 ) + e (1 - 2 7 2 ) (w 2 + W3 ) + Q - e ^ «, 4 = + Q - 



Wi + bui4 



— e I (bwi + W4 



for a = e (l — 27 2 ) (W2 + W3) ,b = (l — 47 2 ). The main idea behind this choice is that the 
difference in the value of the predictor on these points is only due to the values of w\ and W4. Since 
the true values at these points are very different, this should force w± and W4 to take large values 
unless a large error is incurred. To formalize this argument we lower bound the expected Iq (•) loss 
of this predictor by the loss incurred on these heavy points. 

' ' (4 (y(xi) - y( Xl )) + £ (y(x 4 ) - y(x 4 ))) 



E [4 (y(x) - y(x) 



> 



> 



(|l-y( Xl )| + |-l-y(x 4 )|) 
(2 - y(xi) + y(x 4 )) 



2- 



- ej (1 - 6) (to 4 - wi) 
I - (47 2 ) («4 - 



where in the second step we use the fact that £q (x) = \x\ and in the third step we used the fact that 
\a\ + \b\ > a — b. Thus, in order to have expected error at most t\, we require 



W4 — W\ > 



1 

4y2 



fi 



1 



1 



4ei7 2 



for the setting e = | — ei. Thus we have \w%\ + |n; 4 | > w 4 — u>i > 



4ei7^ 



which implies 



max(|wi| , IwA) > 



8ei7 ; 



which proves the result. 



Appendix D Sparse Regression with Similarity functions 

Our utility proof proceeds in three steps. In the first step we project our learning problem, via the 
landmarking step given in Step [T] of Algorithm [T] to a linear landmarked space and show that the 
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Algorithm 2 Sparse regression 1 10 1 



Input: A /3-smooth loss function £(■, •), regularization parameter Cw used in Equationj2] error tolerance e 
Output: A sparse predictor w with bounded loss 

1: fc«_ |"5^], w <°>=0 

2: for t = 1 to k do 

3: 0« <- V„R(wW) = ((w">, x) , y(x))J 

4: r t = arg max * 

5: s t ^ (e^, w ^) + c w \\e^\\ 

\ I I ■ 1 I OO 

6: i)t = mtajl,^j 

7: w(' +1 > 4r- (1 - ijt) wl'l + resign (-0$) C ff e r ' 

8: if5t<ethen 

9: return w'*' 
10: end if 

1 1 : end for 

12: return w <k> 



landmarked space admits a sparse linear predictor with bounded e-insensitive loss. This is formal- 
ized in Theorem[8]which we restate for convenience. 

Theorem 21 (Theorem [8] restated). Given a similarity function that is (eo, B,r)-good for a re- 
gression problem, there exists a randomized map : X — > R d for d = O (j^z l°g such that 

with probability at least 1 — 8, there exists a linear operator f : x >— > (w, x) over R d such that 
|| w|| j < B with e-insensitive loss bounded by eo + £i- Moreover, with the same confidence we have 
l|w|| <Mr. 



Proof. The proof of this theorem essentially parallels that of 1 15 , Theorem 8] but diverges later since 
the aim there is to preserve margin violations whereas we wish to preserve loss under the absolute 
loss function. Sample d landmark points C = {xi, . . . , x^} from the distribution T> and construct 
the map ^ c '■ x l— ^ (^( x , x i): • • ■ ,K(x,x.d)) and consider the linear operator / : x i— > (w,x) 

with Wj = " ) ( X ^)- R ( X ') where di n f = ^ i?(x^) is the number of informative landmarks. In the 



</,„ 



i=i 



:= HwlL < B. Note 



following we will refer to / and w interchangeably. This ensures that 

that we have chosen an Li normalized weight vector instead of an L 2 normalized one like we had 
in Lemma 15 This is due to a subsequent use of sparsity promoting regularizes whose analysis 
requires the existence of bounded L\ norm predictors. 



Using the arguments given for Lemma 



15 



and Theorem 



we can show that if d; n f = Q log | 
then we are done. However, the Chemoff 



(i.e. if we have collected enough informative landmarks), 
bound (lower tail) tells us that for d = Q log jj, this will happen with probability 1 

Moreover, the Chernoff bound (upper tail) tells us that, simultaneously we will also have d m f < 
Together these prove the claim. 



3dr 
2 ' 

□ 



Note that the number of informative landmarks required is, up to constant factors, the same as the 
number required in Theorem [5] However, we see that in order to get these many informative land- 
marks, we have to sample a much larger number number of landmarks. In the following, we shall 
see how to extract a sparse predictor in the landmarked space with good generalization properties. 
The following analysis shall assume the the existence of a good predictor on the landmarked space 
and hence all subsequent results shall be conditioned on the guarantees given by Theorem|8] 
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D.l Learning sparse predictors in the landmarked space 



We use the Forward Greedy Selection algorithm presented in [ 10 1 to extract a sparse predictor in the 
landmarked space. The algorithm is presented in pseudo code form in Algorithm[2] The algorithm 
can be seen as a (modified) form of orthogonal matching pursuit wherein at each step we add a 
coordinate to the support of the weight vector. The coordinate is added in a greedy manner so as 
to provide maximum incremental benefit in terms of lowering the loss. Thus the sparsity of the 
resulting predictor is bounded by the number of steps for which this algorithm is allowed to run. 
The algorithm requires that it be used with a smooth loss function. A loss function Mxl^ M+ 
is said to be /3-smooth if, for all y,a,b E R, we have 



e(a,y) - t(b,y) < ^tfav) 



(a-b) + 



P{a-b) 2 



x—b 



Unfortunately, this excludes the e-insensitive loss. However it is possible to run the algorithm with 
a smooth surrogate whose loss can be transferred to e-insensitive loss. Following |10|, we choose 
the following loss function: 



£g(a, b) = inf 



^v 2 +£ e (a-v,b) 



One can, by a mildly tedious case-by-case analysis, arrive at an explicit form for this loss function 

( \a- b\ < e 

h{a,b) = \ f(|a-6|-e) 2 e < \a - b\ < e + I 
{ \a-b\-e- ± \a-b\ >e+i 

Note that this loss function is convex as well as differentiable (actually /3-smooth) which will be 
crucial in the following analysis. Moreover, for any a, b we have 



0<4 (a,b) - lp(a,b) 



< 



1 



(1) 



Analysis of Forward Greedy Selection: We need to setup some notation before we can describe the 
guarantees given for the predictor learned using the Forward Greedy Selection algorithm. Consider 
a domain X C W l for some d > and the class of functions T — {x n- (w,x) : Hw^ < Cw}- 
For any distribution T> on X and any predictor from F, define 72.p(w) := E \i e ((w, x) , y(x))] 



~>T> 



and ftp (w) := E ^«w, x) , y(x)) 



. Also let w be the minimizer of the following program 

arg min 1Zt> (w) 

/:||w||, <Cw 



(2) 



Then [ 10 Theorem 2.4], when specialized to our case, guarantees that Algorithm[2j when executed 



produces a fc-sparse predictor x, for k — 



with 



with £p(', •) as the loss function for - 
||w||j < Cw such that 

7?<d(w) - TZ-pi'w) < e 2 

Thus, if we can show the existence of a good predictor in our space with bounded L\ norm then this 
would upper bound the loss incurred by the minimizer of Equation|2]and using 1 10 Theorem 2.4] we 
would be done. Note that Theorem [8] does indeed give us such a guarantee which allows us to make 
the following argument: we are guaranteed of the existence of a predictor / with L\ norm bounded 
by B that has e-insensitive loss bounded by (e + e\). Thus if we take Cw = B in Equation [5] and 
use the left inequality of Equation[T[ we get 7vLx>(w) < eo + ei. Thus we have TLd (w) < eo + ei+e2. 
Using Equation jlj (right inequality) with ft = ^, we get TZviyi) < eo + t\ + 3e2/2. 

However it is not possible to give utility guarantees with bounded sample complexities using the 
above analysis, the reason being that Algorithm^ requires us to calculate, for any given vector w, the 



vector V w ^(w) = E 



v 



a 



M(w,x) ,j/(x)) 



which is infeasible to calculate for a distribution 



with infinite support since it requires unbounded sample complexities. To remedy we shall, as 
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suggested by [10], take T> not to be the true distribution over the entire domain X, but rather the 

n 

empirical distribution 2? e mp = ~ l{x=x;} f° r a given sample of training points x l7 . . . , x„. Note 

™ i=l 

that the result in [ 1 1 holds for any distribution which allows us to proceed as before. 

Notice however, that we are yet again faced with the challenge of proving an upper bound on the 
loss incurred by the minimizer of Equation]^ This we do as follows: the predictor / defined in 
Theorem[8]has expected e-insensitive loss over the entire domain bounded by eo + £ i- Hence it will, 

with probability greater than (1 — 6), have at most e + e\ + O (~^%) l° ss on a random sample of n 

points by an application of Hoeffding's inequality. Thus we have 7tx> cmp (w) < e + t x + O (^7=) 
with high probability. 

The main difference in this analysis shall be that the guarantee on w we get will be on its training 
loss rather than its true loss, i.e. we will have T&d^ (w) < eo + t\ + O (^7= J + e 2- However 
since Algorithm[2]guarantees || w|| 1 < Cyy = B, we can still hope to bound its generalization error 



More specifically, Lemma 22 given below, shows that with probability greater than (1 — 5) over the 
choice of training points we will have, for all w e R d , TZx>(w) — T&eu^w) < O (^gj where the 
O (•) notation hides certain log factors. 

Lemma 22 (Risk bounds for sparse linear predictors |fl3l ). Consider a real-valued prediction 
problem y over a domain X = {x : ||x|| < Cx} C M. d and a linear learning model J- : 
{xh} (w, x) : ||w|| < k, < Cw} under some fixed loss function £(■,■) that is C h-Lipschitz 

in its second argument. For any f £ J 7 , let Lt = E p(/(x) , y(x))] and CI be the empirical loss 

on a set of n Ltd. chosen points, then we have, with probability greater than (1 — 5), 
sup (L, - tf) < 2C L C X C W ^ 2] ^ + C L C X C W ^°^ 

Proof. The result for non-sparse vectors, that applies here as well, follows in a straightforward man- 
ner from lfl3l Theorem 1, Example 3. 1(2)] and [23] which we reproduce for completeness. Since the 
Li and norms are dual to each other, for any w g such that || w|| 1 = B and any p e A d , 

where A d is the probability simplex in d dimensions, the Kullback-divergence function KL ( ^ || p) 
is -strongly convex with respect to the L\ norm. We can remove the positivity constraints on the 
coordinates of w by using the standard method of introducing additional dimensions that encode 
negative components of the (signed) weight vector. 



Using 11131 Theorem 1], thus, we can bound the Rademacher complexity of the function class T as 
T^n < Cx Cw \ 2 los 2d ■ Next, using the Lipschitz properties of the loss function, a result from 



allows us to bound the excess error by 2C L K n (T) + ClCxCw^J 1 ^^ 1 . The result then 
follows. □ 

Thus, by applying a union bound, with probability at least (1 — 26), we will choose a training set 
such that /, and consequently w, has bounded loss on that set as well as the uniform convergence 



guarantee of Lemma 22 will hold. Then we can bound the true loss of the predictor returned by 
Algorithm [2] as 

ftp(w) < Tl v (w) + 6 ( -= ] <e + e 1 +e 2 +O ( ^ 

where the first inequality uses the uniform convergence guarantee and the second inequality holds 
conditional on / having bounded loss on a given training set. The final guarantee is formally given 
in Theorem [9] 



Note that using Lemma 



16 



here would at best guarantee a decay of O 1 / - . Transferring e 



insensitive loss to absoluteToss requires an addition of e. Using all the results given above, we can 
now give a proof for Theorem [9] which we restate for convenience. 
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Theorem 23 (Theorem|9]restated). Every similarity function that is (eo, B, r)-goodfor a regression 
problem with respect to the insensitive loss function £ e (•, •) is (eo + e)-useful with respect to absolute 

loss as well; with the dimensionality of the landmarked space being bounded by O (^~2 log and 

the labeled sampled complexity being bounded by O log j^j- Moreover, this utility can be 
achieved by an O (r)-sparse predictor on the landmarked space. 



Proof. Using Theorem |sj we first bound the excess loss due to landmarking by 32B\J y ° g ^ T ^ _ 
Next we set up the (dummy) Ivanov regularized regression problem (given in Equation [2]) with 
the training loss being the objective and regularization parameter Cw = B. The training loss 
incurred by the minimizer of that problem Wi nter is, with probability at least (1 — §), bounded by 

£ (Winter) < e + 325y /l ^r 5 ^ + B \J l0Si n 5) due t0 the guarantees of Theorem]^ Next, we 
run the Forward Greedy Selection algorithm of ifTOl (specialized to our case in Algorithm [2]i and 
obtain another predictor w with L\ norm bounded by B that has empirical error at most C (w) < 

£ (winte, ) + \ ^jf~ • Finally ^ using Lemma 22 we bound the true e-insensitive loss incurred by w 



by C (w) + 2B yj 2 lo s( 2d ) _|_ By ^^J 5 " 1 . Adding e to convert this loss to absolute loss we get that 

with probability at most (1 — 3<5), we will output a fc-sparse predictor in a d-dimensional space with 
absolute regression loss at most 



We note that Forward Greedy Selection gives O (¥) error rates, which are much better, if the loss 

function being used is smooth. This can be achieved by using squared loss 4q (a, b) = (a — b) 2 
as the surrogate. However we note that assuming goodness of the similarity function in terms of 
squared loss would impose strictly stronger conditions on the learning problem. This is because 
E [4q (a, b)} — sup (a — b) ■ E [|a — 6|] and thus, under boundedness conditions, squared loss is 
bounded by a constant times the absolute loss but it is not possible to bound absolute loss (or e- 
insensitive loss) as a constant multiple of the squared loss since there exist distributions such that 

E l\a - 6|] = (53^7 • E [4 q (a, 6)]) and ^^^^ can diverge. 
Below we prove admissibility results for the sparse learning model. 



D.2 Proof of TheoremQl 



To prove the first part, construct a new weight function w(x) = sign (u>(x)) • w. Note that we have 
|u)(x)| < w < B. Also construct the choice function as follows: for any x, let P [i?(x) = l|x] = 



Mx)| 



This gives us E [i?(x)J = ^. Then for any x, we have 



E J»(x')f(x,x')|fi(x') 



E 



E 
E 

x'~X> 



sign (w(x)) wK(x, x 

»(x)if(x,x')|/( 
W (x')A'(x,x')] 



,>(x)| 



B 



I [i?(x) = 1] 



Since /(x) = E Ju^x^if (x, x')] has small e-insensitive loss by (eo, £>)-goodness of K, we have 

x'~Z> 

our result. To prove the second part, construct a new weight function w(x) = ii^-P [i?(x) = l|x]. 
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Note that we have |w)(x)| < — , Then for any x, we have 

\w(x!) 



E [*(x')X(x,x')] = E 



E 

x'~X> 



r 



iJ(x / )if(x,x' 
X(x,x')|i*(x') 



» p [i?(x') = l] 



E [w(x')if(x,x')|i?(x') 

x'~X> 



Since /(x) = E [w(x')if (x, x')|i?(x')] has small e-insensitive loss by (e , B, r)-goodness of 
K, we have our result. 

Using the above result we get out admissibility guarantee. 

Corollary 24. Every PSD kernel that is (e , j)-good for a regression problem is, for any t\ > 0, 
e.Q + e±,0 ^^2) , lj -good as a similarity function as well. 

The above result is rather weak with respect to the sparsity parameter r since we have made no 
assumptions on the distribution of the dual variables a^, a* in the proof of Theorem [6] which is why 
we are forced to use the (weak) inequality % < 1. Any stronger assumptions on the kernel goodness 
shall also strengthen this admissibility result. 

Appendix E Ordinal Regression 

In this section we give missing utility and admissibility proofs for the similarity-based learning 
model for ordinal regression. But before we present the analysis of our model, we give below, an 
analysis of algorithms that choose to directly reduce the ordinal regression problem to real-valued 
regression. The analysis will serve as motivation that will help us define our goodness criteria. 

E.l Reductions to real valued regression 

One of the simplest learning algorithms for the problem of ordinal regression involves a reduction to 
real-valued regression ifTTl [T6l where we modify our goal to that of learning a real valued function 
/ which we then threshold using a set of thresholds \bi\\ =1 with b\ = — oo to get discrete labels as 
shown below 

y/(x) = argmax{6 4 : /(x) > 6 4 } 

ie[r] 

These thresholds may themselves be learned or fixed apriori. A simple choice for these thresholds is 
hi = i — 1 for i > 1. It is easy to show (using a result in [ 17 1) that for the fixed thresholds specified 
above, we have for all / : X —¥ M, 

Cd(y/(x) )2 /(x)) < mm|2|/( X )-y(x)|,|/(x)-y( X )| + i 

< min 1 24 (/(x) - y(x)) + 2e, l t (/(x) - y(x)) + e + ~ 
where in the last step we use the fact that \x\ — e < £ e (x) < \x\. 

It is tempting to use this reduction along with guarantees given for real- valued regression to directly 
give generalization bounds for ordinal regression. To pursue this further, we need a notion of a good 
similarity function which we give below: 

Definition 25. A similarity function K is said to be (eo, B)-good for an ordinal regression problem 
y : X — > [r] if for some bounded weight function w : X — > [— B, B], the following predictor, when 
subjected to fixed thresholds, has expected ordinal regression error at most £q 



/:xh^ E lw{x')K(x,x') 



i.e. E [|y/(x) — y(x)|J < e . 
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From the definition of the thresholding scheme used to define yj from /, it is clear that 

|/(x) — y(x)| < |t//(x) — t/(x)| + |. Since we have £ e (x) < \x\ for any e > 0, we have 
£ e (/(x) — y(x)) < |y(x) — y/(x)| + h and thus we have 



E [4(/(x),y(x) 



1 

2 ' 



Thus, starting with goodness guarantee of the similarity function with respect to ordinal regression, 
we obtain a guarantee of the goodness of the similarity function K with respect to real-valued 
regression that satisfies the requirements of Theorem [5] Thus we have the existence of a linear 
predictor over a low dimensional space with e-insensitive error at most eo + \ + ei. We can now 
argue (using results from [ 17 1) that this real-valued predictor, when subjected to the fixed thresholds, 
would yield a predictor with ordinal regression error at most 



ei + 2e, e 



1 + eo + ei 



However, this is rather disappointing since this implies that the resulting predictor would, on an 
average, give out labels that are at least one step away from the true label. This forms the intuition 
behind introducing (soft) margins in the goodness formulation that gives us Definition [12] Below 
we give proofs for utility and admissibility guarantees for our model for similarity-based ordinal 
regression. 



E.2 Proof of Theorem[l3] 



We use Lemma 



that E 
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to construct a landmarked space with a linear predictor / : x H> (w, x) such 
/ (* (x)) - /(x) < 2ei. As before, we have ||w|| 2 < B and sup {||*(x)||} < 1. In 



the following, we shall first show bounds on the mislabeling error i.e P [y(x) ^ 2/(x)]. Next, we 

x~£> 

shall convert these bounds into ordinal regression loss by introducing a spacing parameter into the 
model. 



Since the 7-margin loss function is 1-Lipschitz, we get 



/(*(x)) - V*) 



< 



[/(x) 



5 y(x)J 



2ei 



fc»(x)+i - /(*(*)) < [& w (x)+i - /(*)] + 2e x 



Which gives us, upon taking expectations on both sides, 



E 

x~X> 



/(*(x)) - b v(x) + &„ (x)+1 - /(¥(x)) 



< e + 4ei 



Lemma 



15 



guarantees the excess loss due to landmarking to be at most 64B 



lgg(Vj) 



Moreover, 



since the 7-margin loss is 1-Lipschitz, Lemma [16| allows us to bound excess loss due to training by 

so that the learned predictor has 7-margin loss at most cq + e\ for any ei given large 
enough d and n. Now, from the definition of the 7-margin loss it is clear that if the loss is greater 
than 7 then it indicates a mislabeling. Hence, the mislabeling error is bounded by e °^ £l . 

This may be unsatisfactory if 7 <C 1 - to remedy such situations we show that we can bound the 



1-margin loss directly. Starting from E 

x~£> 



/(*(x)) - /(x) 



< 2ei, we can also deduce 



E 



l-/(*(x)) + 6„( x ) + 1-6„( X )+1 +/(*(*)) 



< e + 4ei 



We can bound the excess training error for this loss function as well. Since the 1-margin loss directly 
bounds the mislabeling error, combining the two arguments we get the second part of the claim. 

However, the margin losses themselves do not present any bound on the ordinal regression error. 
This is because, if the thresholds are closely spaced together, then even an instance of gross ordinal 
regression loss could correspond to very small margin loss. To remedy this, we introduce a spacing 
parameter into the model. We say that a set of thresholds is A-spaced if min{|6i — > A. 

ie[r] 
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Such a condition can easily be incorporated into the model of 1(171 as a constraint in the optimization 
formulation. 

Suppose that a given instance has ordinal regression error £ 0I $ (y(x), y(x)) = fc. This can happen 
if the point was given a label fc labels below (or above) its correct label. Also suppose that the 
7-margin error in this case is [y(x) — y(x)L = h. Without loss of generality, assume that the 
point x of label k + 1 was given the label 1 giving an ordinal regression loss of Z or( j = fc (a similar 
analysis would hold if the point of label 1 were to be given a label fc + 1 by symmetry of the 
margin loss formulation with respect to left and right thresholds). In this case the value of the 
underlying regression function must lie between b\ and 62 and thus, the margin loss h satisfies 

fc 

h > bk+i + 7 — 62 = 7+ S — bi) > 7 + (fc — 1) A. Thus, if the margin loss is at most 

i=2 

A, the ordinal regression error must satisfy £ on j (y(x), y(xj) < 2 — ^ 1" 1- 

Let i/ja{x) = x+ &~ 1 ■ Using the bounds on the 7-margin and 1-margin losses given above, we get 
the first part of the claim. 

In particular, a constraint of A = 1 put into an optimization framework ensures that the bounds on 
mislabeling loss and ordinal regression loss match since ipi(x) = x for all x. In general, the cases 
where the above framework yields a non-trivial bound for the mislabeling error rate, i.e. £01 < 1 
(which can always be ensured if eo < 1 by taking large enough d and n), also correspond to those 
where the ordinal regression error rate is also bounded above by 1 since sup (ipA (%)) = L 

xe[o,i],A>o 



E.3 Admissibility Guarantees 



We begin by giving the kernel goodness criterion which we adapt from existing literature on large 
margin approaches to ordinal regression. More specifically we use the framework described in lfT6ll 
for which generalization guarantees are given in ifTTll . 

Definition 26. Call a PSD kernel K (eo, ^/)-goodfor an ordinal regression problem y : X — > [r] if 
there exists W* G TLk, ||W*|| = 1 and a fixed set of thresholds {&i} i=1 such that 



E 



1 - 



(wvMx)) 

7 



(W*,$ g (x)) 

7 



- ky(x)+l + 1 



The above definition exactly corresponds to the EXC formulation put forward by [17] except for 
the fact that during actual optimization, a strict ordering on the thresholds is imposed explicitly. 
ifTTl present yet another model called IMC which does not impose any explicit orderings, rather the 
ordering emerges out of the minimization process itself. Our model can be easily extended to the 
IMC formulation as well. 



Theorem 27 (Theorem 14 restated). Every PSD kernel that is (eo, j)-goodfor an ordinal regression 

problem is also ^71 eo + ex, O (^^2^j-good as a similarity function with respect to the ji-margin 

loss for any 71, ei > 0. Moreover, for any t\ < 71 /2, there exists an ordinal regression instance and 
a corresponding kernel that is (0, j)-good for the ordinal regression problem but only (e%, B)-good 

as a similarity function with respect to the ^i-margin loss function for B = Q 



7i 
ei7 2 



Proof. We prove the two parts of the result separately. 
Part 1: Admissibility: As before, using Lemma 



Bit is possible to obtain a vector W = ( a i ~ 
»=i 

a*)$x(xi) G Hk such that < q,, a* < piC (by applying the KKT conditions) and the following 
holds: 



E 



[b y(x) + 1 - <W, *jr(x))] + [<W, $*(x)) - b y[x)+1 + 1] 



(3) 



This allows us to construct a weight function Wi 



have any guarantee that o^a* = 0) and E lw(jx.')K(x, x' 



such that \wi\ < 2C (since we do not 
= (W',$ x (x)) for all x G X. 
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Denoting /(x) := E lw(x.')K(x, x')] for convenience gives us 

x'~Z> 



x~X> 



E, [ [/(x) - V*)] i + [Wi - /(*)] J = JE, I [1 " / W + 6„(x)] + + [1 - byv+i + /(x)] + 

1 



where in the first step we used [ir], = [1 
following: 



Now use the fact [x] 1 = - [yx] to get the 



E 



[7i/(x) - Ti^(x)]„, + [7i^/(x)+i - 7i/(x)], 



< 



7i 
2C 7 2 



7i£o 



Note that it is not possible to perform the analysis on the loss function [•] directly since using 
it requires us to scale the threshold values by a factor of 71 that makes the result in Equation [3] 
unusable. Hence we first perform the analysis for [-} v utilize Equation [3] and then interpret the 
resulting inequality in terms of [■] . 

Setting 2C = j^i, using w'(x) = jiw(x) as weights, using b'j = jibj as the thresholds and noting 
that the new bound on the weights is |u>-| < 2C71 gives us the result. As before, using variational 
optimization techniques, this result can be extended to non-discrete distributions as well. □ 

In particular, setting 71 = 7 gives us that any PSD kernel that is (eo,7)-good for an ordinal re- 
gression problem is also ^ 7 eo + ei, j^j -good as a similarity function with respect to the 7-margin 
loss. 

Part 2: Tightness: We adapt our running example (used for proving the lower bound for real 
regression) for the case of ordinal regression as well. Consider the points with value — 1 as having 
label 1 and those having value +1 as having label 2. Clearly, w = (1, 0, 0) along with the thresholds 
bi = —00 and &2 = establishes the native inner product as a (0, 7)-good PSD kernel. 

Now consider the heavy points yet again and some weight function and threshold b 2 (61 is always 
fixed at —00) that is supposed to demonstrate the goodness of the inner product kernel as a similarity 
function. Clearly we have 



E 

x~X> 



[/(x) - &„( X )] T1 + [&»(x)+l - /( X )] 71 



> 



> 



([/( Xl )-& 2 ] 7i + [& 2 -/(x4)] 

([71 ^ + b 2 } + + [ 7l -b 2 + /(x 4 )] + ) 



(2 7l - /( Xl ) + /(x 4 )) 
1 



271 - ( g - £ ) (1 - b ) («>4 - W l) 



where in the third step we have used the fact that [a] 
error at most e\, we must have 



(271- Q- £ ) ( 4 ^ 2 ) K-^i) 

[b] , > a + b. Thus, in order to have expected 



— u>i > 



1 

4^2 



271 - T 



7i 2 
4ei 7 2 



by setting e 



^ which then proves the result after applying an averaging argument. 



-1 



Appendix F Ranking 



The problem of ranking stems from the need to sort a set of items based on their relevance. In the 
model considered here, each ranking instance is composed of m documents (pages) (pi, . . . ,p m ) 
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from some universe V along with their relevance to some particular query q € Q that are given as 
relevance scores from some set Kcl. Thus we have X = Q x V m with each instance x e X 
being provided with a relevance vector r(x) = lZ m . Let the i th query-document pair of a ranking 
instance x be denoted by Zi e Q x V. For any z — (p,q) e V x Q, let r(z) € K denote the true 
relevance of document p to query q. 

For any relevance vector r 6 7vL"\ let r be the vector with elements of r sorted in descending 
order and 7r r be the permutation that this sorting induces. For any permutation ir, w(i) shall denote 
the index given to the index i under ir. Although the desired output of a ranking problem is a 
permutation, we shall follow the standard simplification [27] of requiring the output to be yet another 
relevance vector s with the permutation 7r s being considered as the actual output. This converts the 
ranking problem into a vector-valued regression problem. 

We will take the true loss function ^ actua i (•, ■) to be the popular NDCG loss function [28 1 defined 
below 

1 ^ G(r(i)) 



KNDCG 



(s,r) = -: 



where llrll n = max — t^-t-tt, G(r) = 2 r — 1 is the growth function and F(t) = loafl + t) is 

ttSS™ F(tt(i)) 

the decay function. 

For the surrogate loss functions Ik and we shall use the squared loss function £ sq (s, r) = 

||s — r 1 1 2 - We shall overload notation to use 4q (•, ) upon reals as well. For any vector r g lZ m , let 
G(r) 

r](r) := — and let denote its i th coordinate. 

\\G(j)\\ D 

Due to the decomposable nature of the surrogate loss function, we shall require kernels and simi- 
larity functions to act over query-document pairs i.e. K : (P x Q) x (? x Q) -> K. This also 
coincides with a common feature extraction methodology (see for example l27l [29]) where every 
query-document pair is processed to yield a feature vector. Consequently, all our goodness defi- 
nitions shall loosely correspond to the ability of a kernel/similarity to accurately predict the true 
relevance scores for a given query-document pair. We shall assume ranking instances to be gener- 
ated by the sampling of a query q ~ T>q followed by m independent samples of documents from 
the (conditional) distribution D-p\ q . The distribution over ranking instances is then a product dis- 
tribution T> = T>x = T>q x 2}p|g x T> -p\ q x ... x T)-p\ q . A key consequence of this generative 

V v ' 

m times 

mechanism is that the i th query-document pair of a random ranking instance, for any fixed i, is a 
random query-document instance selected from the distribution [i := T)q x T>-p\ q . 

Definition 28. A similarity function K is said to be (eo, B)-good for a ranking problem y : X — > 
S m if for some bounded weight function w : V x Q —X [-B, B], for any ranking instance x = 
(q,Pi,P2, ■ ■ ■ ,Pm)> if we define f : X — X WL m as 

fi := E Hz)#(z i)Z )] 
where Zi — (pi, q), then we have E \£ sq (/(x), rj{r{z)))\ < eo- 

Definition 29. A PSD kernel K is said to be (eo, j)-good for a ranking problem y : X — x S m if 
there exists W* G %k> ||W*|| = 1 such that if for any ranking instance x = (q,Pi,P2, ■ ■ ■ iPm), 
if, for any W £ Hk, when we define f ( • ; W) : X —X R m as 

/,(x;W)= < W ^'» 

7 

where fi is the i fh coordinate of the output of f and z,; = (pi,q), then we have 
E [^(/(x;W*),i,(r(z)))l<eo. 

The choice of this surrogate is motivated by consistency considerations. We would ideally like a 
minimizer of the surrogate loss to have bounded actual loss as well. Using results from [27 1, it can 
be shown that the above defined surrogate is not only consistent, but that excess loss in terms of 
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this surrogate can be transferred to excess loss in terms of £ndcg •)> a very desirable property. 
Although E71 shows this to be true for a whole family of surrogates, we chose £ sq (•, •) for its 
simplicity. All our utility arguments carry forward to other surrogates defined in E7l with minimal 
changes. 

We move on to prove utility guarantees for the given similarity learning model. 

Theorem 30. Every similarity function that is (eo, B)-goodfor a ranking problem for m-documents 

with respect to squared loss is O (^/ lo " m ■ v^o) ' use fi^ with respect to NDCG loss. 



(w, x) such that E 



Proof. As before, we use Lemma 15 to construct a landmarked space with a linear predictor / : x h-» 

/(* (z)) - /(z) | < 2e 1 . We have ||w|| 2 < B and sup {||*(x)||} < 1. 



Now lets overload notation to denote by \l/(x) the concatenation of the images of the m document- 
query pairs in x under *&(■) and by /(^(x)), the m-dimensional vector obtained by applying / to 
each of the m components of \l/(x). 

Since the squared loss function is 2_B-Lipschitz in its first argument in the region of interest, we get 



E 



4 q /(*(x)),7 ? (r(x)) 



E 



2 4, (/(*(*)), Tj(r(x)) 
.i=i 

£=1 X ^ 
m 

= E E „ I^(/W,*?(r(x))01 + 

z — * X~Z> 

£=1 
m 

^x-p l 4 " (/(*( z 0)^(r(x)) i ) -4 q (/(*), *7(r(x))i) 

i=l X ~ 

m m 

< £ Ej4 q (/K),r?(r(x)) l )]+2i?E E f - /(*) 



= E E „ [4, (/K), »7(r(x))()] + 2B E 

m 

< E E [<„(/W ) i/(r(x)) j )]+4Bm £l 



E 



/(¥(*)) - /(, 



E 



Em/kumx)), 



ABmei 



= E [4 q (/(x), ?y(r(x)))] + ABmei 
< e + LBmei 

where x = (<7,£>i, . . . ,Pm) an d z « = <?)■ In the first and the last but one step we have used 
decomposability of the squared loss, in the fourth step we have used Lipschitz properties of the 
squared loss, in the fifth step we have used properties of the generative mechanism assumed for 
ranking instances, in the sixth step we have used the guarantee given by Lemma [TB] Throughout we 
have repeatedly used linearity of expectation. This bounds the excess error due to landmarking to d 



dimensions by 64B 2 m 2 J log ^/' 5 ) using Lemma 



15 



Similarly, Lemma 



16 



also allows us to bound 



the excess error due to training by 3B 2 y los ^/ s ^ which puts our total squared loss at eo + £i f° r 
large enough d and n. 



We now invoke [27, Theorem 10] that states that if the surrogate loss function £(■, •) being used is a 
Bregman divergence generated by a function that is C^-strongly convex with respect to some norm 
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|-|| then we can bound ^ndcg (s, r) < -j= ■ y/£(s, r) where Cp = 2 



\FlTj'- ■ ■ ' F(m) J 



is the decay function used in the definition of NDCG and 1 1 ■ 1 1 + is the dual norm of 1 1 • 1 1 . Note that we 
are using the "noiseless" version of the result where r(x) is a deterministic function of x. 

In our case the squared loss is 2-strongly convex with respect to the L2 norm which is its own dual. 
Hence C's = 2 and Cp = O y lo " m j , if / : x i-> (w, \&(x)) is our final output, we get, for some 
constant C, 



E 

x~X> 



'NDCG 



e + ABmex < C 



log m 



2m 



which proves the claim. This affects the bounds given by Lemmata 15 and 16 since the dependence 



of the excess error on d and n will now be in terms of the inverse of their fourth roots instead of 
inverse of the square roots as was the case in regression and ordinal regression. □ 



We note that the (rather heavy) dependence of the final utility guarantee (that is O U/meo)) on m 
is because the decay function F(t) — log(l + t) chosen here (which seems to be a standard in 
literature but with little theoretical justification) is a very slowly growing function (it might sound a 
bit incongruous to have an increasing function as our decay function - however since this function 
appears in the denominator in the definition of NDCG, it effectively induces a decay). Using decay 
functions that grow super-linearly (or rather those that induce super-linear decays), we can ensure 
O (y/co) -usefulness since in those cases, Cp = O (1). 

We next prove admissibility bounds for the ranking problem. The learning setting as well as the 
proof is different for ranking (due to presence of multiple entities in a single ranking instance), 
hence we shall provide all the arguments for completeness. 

Theorem 31. Every PSD kernel that is (eQ,~f)-good for a ranking problem is also 
£q + ei) O ^ £l "y^3 ^ ) -good as a similarity function for any e\ > 0. 



Proof. For notational convenience, we shall assume that the RKHS Hk is finite dimensional so 
that we can talks in terms of finite dimensional matrices and vectors. As before, let /(z;W) = 
(W, $if (z)) and let W be the minimizer of the following program. 



1 



min ^ || W\\ 2 Hk +CE [4 q (/(x; W), r?(r(x) 



1 



min - WL +C E 
wew K 2 Hk x~. 



^4 q (/(z i ;W), 7? (r(x)) i ; 

i=l 

\W\\ 2 U K +CT. E [4 q (/(zi;W), ?7 (r(x)) i ) 



= mm 

•weu K 2 " " ^— ' x~x> 

2 — 1 

w ™ \ VNf UK + mCE^ [£ sq (/(z; W), f(z))] + C v 

where for any z 6 Q x V, r(z) gives us the expected normalized relevance of this document-query 
pair across ranking instances and Cp is some constant independent of W and dependent solely on 
the underlying distributions. Using the goodness of the kernel K and the argument given in the 



proof of Lemma 17 it is possible to show that the vector W has squared loss at most 2 J^ 2 + e . 
Hence the only task remaining is to show that their exists a bounded weight function w such that 
for all z e V x Q, we have /(z; W) = (W',$ K (z)) = E \w(z)K(z, z')] which will prove the 

claim. 
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To do so we assume that the (finite) set of document-query pairs is (zi, . . . , Zfc) with z, having 
probability ^ and relevance ?-j = f(zj). Then the above program can equivalently be written as 



mm 

W6« K 



mm 



mm 

W6W K 



mm 



IW! 



Hi 



2 



+ mC \\\fPX T W- VPt 



+ mC 



\\\Xa\\ 2 HK +mC 



X T W-r 



X T Xa - r 



where X = (&k{ x i)i ■ ■ ■ ■> ®K( z k)), r = ( r i, ■ ■ ■ , r k) T , P is the k x k diagonal matrix with 
Pa = /Xj, X = X\TP and v = \/ Pr. The last step follows by the Representer Theorem which tells 
us that at the optima, W = Xa for some a € R k . 

Some simple linear algebra shows us that the minimizer a has the form 

-l 



a 



X 1 XX 1 X 



1 



2mC 

g \ _1 



X 1 X 



X 1 Xr 



= {PG+^-^\ G^GPr 



PG 



2mC 
I 

2mC 



Pr 



where G = X T X is the Gram matrix given by the kernel K. In the third step we have assumed 
that G does not have vanishing eigenvalues which can always be ensured by adding a small positive 
constant to the diagonal. Thus we have 

I 



PG + 



2mC 



a = Pr 



looking at the i th element of both sides we have 



which gives us on = 2mCfj,i (ri — (W, Now assume, without loss of generality, that 

the relevance scores are normalized, i.e. < 1 for all i. Thus we have 



I||W'||^+mC 



i T w - f 



2 2 <l\\0f nK +mC 



which gives us | ||W'||^ < m(7||f|| 2 < raC ^2 ^i — mC which gives us ||W'|j < \/2mC. 

Since the kernel is already a normalized kernel, ||$x( z i)|| < 1 which gives us, by an application of 
Cauchy-Schwartz, |aj| < 2mC/U,(l + \Jm2C) < b^itnCy/mC . 

If we now establish a weight function over the domain w l — then < bmCVmC and we 
can show that for all z, we have (W, $r-(z)) = E fw(z)K(z, z')]. Setting C = finishes 

the proof. □ 



Appendix G Supplementary Experimental Results 

Below we present additional experimental results for regression and ordinal regression problems. 
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(a) Mean squared error for landmarking (RegLand), sparse landmarking (RegLand-Sp) and kernel regression (KR) 
for the Gaussian kernel 
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(b) Mean squared error for landmarking (RegLand), sparse landmarking (RegLand-Sp) and kernel regression (KR) 
for the Euclidean kernel 
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(c) Avg. absolute error for landmarking (ORLand) and kernel regression (KR) on ordinal regression datasets for the 
Manhattan kernel 
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(d) Avg. absolute error for landmarking (ORLand) and kernel regression (KR) on ordinal regression datasets for the 
Gaussian kernel 



Figure 2: Performance of landmarking algorithms with increasing number of landmarks on real re- 



gression (Figures 2a and 2b i and ordinal regression datasets (Figures 2c and 2d I for various kernels. 



G.l Regression Experiments 

We present results on various benchmark datasets considered in Section|4]for Gaussian K(x, y) = 
exp ^— ^ x 9o .^ 2 j and Euclidean: K(x, y) = — ||x — kernels. Following standard practice, we 
fixed <7 to be the average pairwise distance between data points in the training set. 



G.2 Ordinal Regression Experiments 

We present results on various benchmark datasets considered in Section|4]for Gaussian K(x, y) 
exp (— ^ x 9g .^ 2 ) and Manhattan: A"(x, y) = — ||x — y|| x kernels. 



30 



